Machine learning-based models for prediction of the risk of stroke in coronary artery disease patients receiving coronary revascularization

Background To construct several prediction models for the risk of stroke in coronary artery disease (CAD) patients receiving coronary revascularization based on machine learning methods. Methods In total, 5757 CAD patients receiving coronary revascularization admitted to ICU in Medical Information Mart for Intensive Care IV (MIMIC-IV) were included in this cohort study. All the data were randomly split into the training set (n = 4029) and testing set (n = 1728) at 7:3. Pearson correlation analysis and least absolute shrinkage and selection operator (LASSO) regression model were applied for feature screening. Variables with Pearson correlation coefficient<9 were included, and the regression coefficients were set to 0. Features more closely related to the outcome were selected from the 10-fold cross-validation, and features with non-0 Coefficent were retained and included in the final model. The predictive values of the models were evaluated by sensitivity, specificity, area under the curve (AUC), accuracy, and 95% confidence interval (CI). Results The Catboost model presented the best predictive performance with the AUC of 0.831 (95%CI: 0.811–0.851) in the training set, and 0.760 (95%CI: 0.722–0.798) in the testing set. The AUC of the logistic regression model was 0.789 (95%CI: 0.764–0.814) in the training set and 0.731 (95%CI: 0.686–0.776) in the testing set. The results of Delong test revealed that the predictive value of the Catboost model was significantly higher than the logistic regression model (P<0.05). Charlson Comorbidity Index (CCI) was the most important variable associated with the risk of stroke in CAD patients receiving coronary revascularization. Conclusion The Catboost model was the optimal model for predicting the risk of stroke in CAD patients receiving coronary revascularization, which might provide a tool to quickly identify CAD patients who were at high risk of postoperative stroke.


Introduction
Coronary artery disease (CAD) is the most common cardiovascular diseases wherein atherosclerosis occurs in one or more of the coronary arteries [1].CAD was reported to be one of the major causes of mortality in both the developed and developing countries [2].Currently, percutaneous coronary intervention (PCI) and coronary artery bypass grafting (CABG) are common coronary revascularization procedures [3].With the development and application of drug-eluting stents and minimally invasive surgery, the prognosis of patients undergoing PCI or CABG was improved, but some patients still have postoperative adverse cardiovascular events, which result in worse prognosis [4].Stroke is a cerebrovascular disorder which is the second leading cause of mortality and morbidity worldwide [5].Stroke is a prevalent complication among surgical and ICU patients, with postoperative stroke incidence in cardiac surgery patients ranging from 0.8% to 9% [6].Previous evidence suggested that the occurrence of stroke occurrence was associated with a significantly elevated risk of mortality in patients undergoing PCI or CABG procedures [7,8].Constructing predictive models to accurately identify patients receiving coronary revascularization who were at high risk of stroke is of great significance.
Recently, machine learning methods are gradually applied to the construction of clinical models in order to improve the accuracy of clinical diagnosis or prediction of diseases [9].Machine learning method was widely applied in predicting poor prognosis after heart surgery and the risk of postoperative stroke, which presented better performance than traditional risk models such as logistic regression [10][11][12].However, no studies have reported the use of machine learning to predict the risk of postoperative stroke in patients undergoing coronary revascularization.
This study intended to construct several prediction models for the risk of stroke in CAD patients who underwent coronary revascularization based on machine learning methods.The optimal prediction model was identified and the predictive value was compared with traditional logistic regression model.

Study design and population
In this cohort study, the records of 6289 CAD patients receiving coronary revascularization were obtained in Medical Information Mart for Intensive Care IV (MIMIC-IV).MIMIC-IV builds upon the success of MIMIC-III and incorporates numerous enhancements from 2008 to 2019.MIMIC-IV is a relational database that encompasses authentic hospitalizations of patients admitted to a tertiary academic medical center located in Boston, MA, USA.Each patient's length of stay, laboratory tests, medication treatment, vital signs and other comprehensive information during their ICU stay were recorded [13].Patients with age < 18 years old and those with the length of ICU stay less than 24 h were excluded.Finally, 5757 participants were included.The requirement of ethical approval for this was waived by the Institutional Review Board of The second hospital of Dalian medical university, because the data was accessed from MIMIC-IV (a publicly available database).The need for written informed consent was waived by the Institutional Review Board of The second hospital of Dalian medical university due to retrospective nature of the study.

Construction of the prediction models
Logistic regression model is a classification algorithm evolved from linear regression, and belongs to a Sigmoid function normalization model of generalized linear regression model, which is commonly used to solve binary classification problems and has strong explanatory ability [14].
Support vector machine (SVM) is a classification algorithm, and it can also be classified.Different models can be made according to different input data.If the input label is classified value, SVC() is used for classification.This algorithm improves the generalization ability of learning machine by seeking the minimum structural risk, and minimizes the empirical risk and confidence range.Its basic model is defined as the linear classifier with the largest interval on the feature space, that is, the learning strategy of support vector machine is to maximize the interval, and it can be converted into the solution of a convex quadratic programming problem [15].
Random forest is an extended variant of Bagging ensemble learning.On the basis of constructing Bagging ensemble with decision tree as a base learner, it further adds the selection of random attributes to the training process of decision tree.For each node of the base decision tree, a subset containing k attributes is randomly selected from the candidate attribute set of the node.Then an optimal attribute is selected for division.The method of prediction stage of this algorithm is Bagging strategy.The classification model uses voting method to determine the final result, and the regression model uses mean method to determine the final result [16].
Extreme Gradient Boosting (XGBoost) is an efficient gradient lifting decision tree algorithm, which is improved on the basis of the original Gradient Boosting Decision Tree.As a forward addition model, its core is to adopt the Boosting thought, which integrates multiple weak learners into a strong learner by a certain method, that is, multiple trees make decisions together, and the result of each tree is the difference between the target value and the predicted result of all the previous trees, adding up all the results to get the final result.In this way, the effect of the whole model is improved [17].
Adaptive Boosting (Adaboost) is an iterative algorithm to train different classifiers for the same training set, and then set these weak classifiers together to form a stronger final classifier.
The set strategy is to increase the weight of the samples that were classified wrong by the previous round of classifiers.Reducing the weight of samples with correct classification will get more attention from the following classifiers, and then weaker classifiers can be generated.By combining these weak classifiers with majority weighted voting, the classifier with small error rate is increased, and the classifier with large error rate is reduced, so that it plays a less role in voting [18].
Naive bayes is one of the most widely used classification algorithms.It is a classifier method based on Bayesian definition and independent assumption of feature conditions.Naive Bayes algorithm is based on Bayesian principle and uses the knowledge of probability statistics to classify sample data sets.It is characterized by the combination of prior probability and posterior probability, which avoids the subjective bias of using only prior probability, and also avoids the overfitting phenomenon of using sample information alone [19].
K-nearest neighbor (KNN) is one of the most basic and simplest algorithms in the machine learning algorithm model.It can be used for classification and regression by measuring the distance between different eigenvalues.The working principle is to use the training data to partition the eigenvector space and take the partition result as the final algorithm model.
Categorical boosting (Catboost) is a kind of gradient boosting algorithm library that can handle categorical features well.It has made some improvements on the basis of the original Gradient Boosting Decision Tree.Specifically, the algorithm has two characteristics of adaptive learning rate and categorical feature processing, which can help the algorithm better control the contribution of the weak learner in each iteration.In addition, the algorithm can deal with categorical features efficiently and reasonably, so it can deal with the influence of categorical features better [20].

Statistical analysis
Mean ± SD was used to describe the distribution of measurement data subject to normal distribution, and t-test was used to compare the difference between groups.Median and quartiles were used to describe the distribution of measurement data that did not follow normal distribution, and Wilcoxon rank sum test was used to compare the difference between groups.The enumeration data were expressed as number and percentage of cases [n (%)], and the Chisquare test or Fisher's exact probability were used to compare the differences between the groups.Missing values <20% were dealt by Random forest interpolation, and �20% were deleted (Table 1).Sensitivity analysis were performed before and after interpolation (Table 2).All the data were randomly split into the training set (n = 4029) and testing set (n = 1728) at 7:3 with the random seed if 42 [21].Pearson correlation analysis and least absolute shrinkage and selection operator (LASSO) regression model were applied for feature screening.Variables with Pearson correlation coefficient<9 were included, and the regression coefficients were set to 0. Features more closely related to the outcome were selected from the 10-fold cross-validation, and features with non-0 Coefficient were retained and included in the final model.Eight prediction models were constructed, and the parameter settings were shown in Table 3.The predictive values of the models were evaluated by sensitivity, specificity, area under the curve (AUC), accuracy, and 95% confidence interval (CI).The confidence level alpha = 0.05.Missing value interpolation, training set and testing set split, data modeling and result visualization were completed using Python 3.9.12.Sensitivity analysis and difference comparison were performed by SAS 9.4 (SAS Institute Inc., Cary, NC, USA).

Comparisons of the characteristics of participants with and without postoperative stroke in the training set
A total of 6289 CAD patients undergoing coronary revascularization were identified in MIMI-C-IV.Among them, patients with the length of ICU stay less than 24 h were excluded (n = 532).Finally, 5757 participants were included.All patients were divided into the postoperative stroke group (n = 433) and postoperative non-stroke group according whether postoperative stroke occurred.The screen process of the participants was exhibited in Fig 1.
The percentage of patients with personal history of stroke in those with postoperative stroke was higher those without (15.86%vs 5.35%; P<0.001).The mean INR in patient with postoperative stroke was higher those without (1.41vs 1.37; P = 0.001).The median length of stay in patents with postoperative stroke was higher those without (2.31days vs 1.83 days; P<0.001).More detailed information was observed in Table 4.

Construction and the predictive values of the prediction models for the risk of stroke in CAD patients receiving coronary revascularization
Initially, 40 features included, and 45 features were identified after one-hot encoding during the discretization of classification features.There were 39 variables with Pearson correlation coefficient<9.In order to ensure the stability and efficiency of features, valuable feature sets were selected from the 10-fold cross-validation results.As λ gradually expanded from 10 −10 to 10 10 , the number of variables entering the model decreased.When λ was 0.002984, LASSO regression model showed the best prediction performance.Finally, 20 features with non-0 Coefficient were retained, which were age, GCS, CCI, weight, heart rate, DBP, SPO 2 , platelet, creatinine, INR, PTT, BUN, glucose, calcium, Ethnicity-unknown, Ethnicity-White, personal

The predictive values of the prediction models for the risk of stroke in CAD patients undergoing coronary revascularization
The predictive values of prediction models for stroke in CAD patients undergoing coronary revascularization were presented in Table 5.The results delineated that Catboost model presented the best predictive performance with the AUC of 0.831 (95%CI: 0.811-0.851) in the training set, and 0.760 (95%CI: 0.722-0.798) in the testing set.The AUC of the logistic regression model was 0.789 (95%CI: 0.764-0.814) in the training set and 0.731 (95%CI: 0.686-0.776) in the testing set (Table 6).The results of Delong test revealed that the predictive value of the Catboost model was significantly higher than the logistic regression model (P<0.05).The ROC curves of machine learning models in the training set and testing set were respectively shown in Figs 4 and 5.The importance of each feature in the Catboost model was displayed in Fig 6, which depicted that CCI was the most important variable associated with the risk of stroke in CAD patients undergoing coronary revascularization.
The SHapley Additive exPlanations (SHAP) values of features in the Catboost model were visualized in Fig 7, with SHAP values on the X-axis, features on the Y-axis, and each point representing a sample.The redder color indicates a stronger effect of the feature on the outcome, while the bluer color indicates a weaker effect.CCI was an important factor that exhibited a positive correlation with the risk of stroke in CAD patients following coronary revascularization.Creatinine levels were found to be associated with the risk of stroke, as indicated by blue dots primarily concentrated in areas where SHAP values exceeded 0, suggesting that lower creatinine levels were linked to higher stroke risk.exceeded 0, the risk of postoperative stroke was increased.CCI�2 was associated with an increased risk of stroke.

Discussion
The present study constructed several prediction models for the risk of stroke in CAD patients who received coronary revascularization based on machine learning methods.The results demonstrated that Catboost model was the optimal model for predicting the risk of stroke in CAD patients who received coronary revascularization.The AUC of Catboost model was 0.831 in the training set, and 0.760 in the testing set, which were higher than the logistic regression model.The findings might provide a novel and quick tool to identify CAD patients receiving coronary revascularization treatments who were at high risk of stroke, and offer timely interventions to prevent the poor prognosis.Previously, several prediction models were constructed to predict the risk of cardiovascular events in CAD patients receiving coronary revascularization.Zhang et al. built a nomogram for predicting major adverse cardiovascular events after PCI in coronary heart disease patients with chronic kidney disease, and the AUC value of the model was 0.612 [22].Another prediction model for predicting major adverse cardiac and cerebrovascular events among high-risk myocardial infarction patients undergoing primary PCI had an AUC of 0.883 in the testing set [23].A very early prediction model for stroke patients undergoing CABG had an AUC of 0.70 [24].A multicenter Spanish study established a multivariate prediction model for perioperative in-hospital cerebrovascular accident after coronary bypass surgery, and the AUC was 0.77 [25].The present study constructed several prediction models using machine learning method, and the Catboost model had the optimum predictive value for the risk of stroke in CAD patients who underwent coronary revascularization.The prediction model can handle irregular data, missing values and other problems well, and has good robustness, and effectively prevent overfitting, which also makes the model more general; in addition, it can match any advanced machine learning algorithm in terms of model performance [20].The prediction model might provide an easy tool for the clinicians to quickly identify CAD patients undergoing coronary revascularization who were at high risk of stroke.The success of deep learning and machine learning has brought excitement and high expectations in revolutionary changes in health care in CAD patients [26][27][28].The deep learning and machine learning algorithm could achieve more accurate results and outperform statistical methods.The findings of this study might be interesting for other researchers from different fields.
CCI is a measure of comorbidity burden that facilitates the evaluation of the prognostic significance of various clinical conditions based on their quantity and individual prognostic impact [29].CCI has been extensively investigated in various clinical conditions and its significance as a prognostic indicator has been demonstrated.A previous study depicted that CCI was higher in patients with a more diffuse extent of CAD than those with milder disease [30].Rashid et al. indicated that CCI >2 significantly increased the risk of mortality in acute coronary syndrome [31].CCI was also identified to be a predictor of readmission in CAD patients [32].CCI was reported to be highly associated with long-term survival and almost equivalent to left ventricular ejection fraction [33].CCI was independently associated with an increase in 30-day, 1-year and 5-year cardiac death and major adverse cardiovascular events [34].Other studies also revealed that CCI was a reliable indicator for the mortality of ischemic stroke patients [35][36][37].Herein, CCI was found to be a vital predictor for the risk of stroke in CAD patients who received coronary revascularization.Age was identified to be an important variable associated with the outcomes of post-ST-segment elevation myocardial infarction patients among patients without preexisting coronary artery disease [38].In this study, age was also related to the risk of stroke in CAD patients who received coronary revascularization.
Several prediction model for stroke risk in CAD patients with coronary revascularization treatments was constructed based on a variety of machine learning methods, which might provide certain references for early identification of high-risk patients and management of postoperative complications.Some limitations existed in this study.Firstly, although all the data were divided into the training set and testing set, the samples were from a single center, and the model should be applied with caution.Secondly, due to the limitations of the MIMIC database, some preoperative and postoperative data, and data related to liver enzymes could not be obtained.More studies were required to verify the results in this study.

Conclusions
The current study established several prediction models and identified an optimal model for the risk of stroke in CAD patients who received coronary revascularization.The prediction model might offer quick tool to identify CAD patients receiving coronary revascularization who were at high risk of stroke, and make specific treatment strategies to prevent the occurrence of stroke.

Fig 2 .
Fig 2. The changes of MSE with Lambda in the Lasso regression.https://doi.org/10.1371/journal.pone.0296402.g002 Fig 8 depicted the SHAP value analysis of each sample in the Catboost model, where blue represents negative feature contribution and red indicates positive contribution.The length of an arrow signifies the degree of influence that a feature has on output, and its reduction or increase can be observed through the scale value on the X-axis.Base value refers to the average output of the model and training data, while the

Table 4 .
(Continued)stroke-yes, beta-blockers-yes, calcium channel blockers-yes, and vasopressors-yes.Fig 2 presented the changes of mean-squared error (MSE), and Fig 3 showed the changes of Coefficients with Lambda in the Lasso regression.